324 research outputs found

    Folding: reporting instantaneous performance metrics and source-code references

    Get PDF
    Despite supercomputers deliver huge computational power, applications only reach a fraction of it. There are several factors limiting the application performance, and one of the most important is the single processor efficiency because it ultimately dictates the overall achieved performance. We present the folding mechanism, a process that combines measurements captured through minimal instrumentation and coarse-grain sampling ensuring low time dilation (less than 5%). The mechanism reports instantaneous performance and source-code references for optimized binaries accurately by taking advantage of the repetitiveness of many applications, especially in HPC. The mechanism enables the exploration of the application performance and guides the analyst to source-code modifications

    Folding: reporting instantaneous performance metrics and source-code references

    Get PDF
    Despite supercomputers deliver huge computational power, applications only reach a fraction of it. There are several factors limiting the application performance, and one of the most important is the single processor efficiency because it ultimately dictates the overall achieved performance. We present the folding mechanism, a process that combines measurements captured through minimal instrumentation and coarse-grain sampling ensuring low time dilation (less than 5%). The mechanism reports instantaneous performance and source-code references for optimized binaries accurately by taking advantage of the repetitiveness of many applications, especially in HPC. The mechanism enables the exploration of the application performance and guides the analyst to source-code modifications

    The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform

    Get PDF
    The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft

    On the roles of the programmer, the compiler and the runtime system when programming accelerators in OpenMP

    Get PDF
    OpenMP includes in its latest 4.0 specification the accelerator model. In this paper we present a partial implementation of this specification in the OmpSs programming model developed at the Barcelona Supercomputing Center with the aim of identifying which should be the roles of the programmer, the compiler and the runtime system in order to facilitate the asynchronous execution of tasks in architectures with multiple accelerator devices and processors. The design of OmpSs is highly biassed to delegate most of the decisions to the runtime system, which based on the task graph built at runtime (depend clauses) is able to schedule tasks in a data flow way to the available processors and accelerator devices and orchestrate data transfers and reuse among multiple address spaces. For this reason our implementation is partial, just considering from 4.0 those directives that enable the compiler the generation of the so called “kernels” to be executed on the target device. Several extensions to the current specification are also presented, such as the specification of tasks in “native” CUDA and OpenCL or how to specify the device and data privatization in the target construct. Finally, the paper also discusses some challenges found in code generation and a preliminary performance evaluation with some kernel applications.Peer ReviewedPostprint (author’s final draft

    A trace-scaling agent for parallel application tracing.

    Get PDF
    Tracing and performance analysis tools are an important component in the development of high performance applications. Tracing parallel programs with current tracing tools, however, easily leads to large trace files with hundreds of Megabytes. The storage, visualization, and analysis of such trace files is often difficult. We propose a trace-scaling agent for tracing parallel applications, which learns the application behavior in runtime and achieves a small, easy to handle trace. The agent dynamically identifies the amount of information needed to capture the application behavior. This knowledge acquired at runtime allows recording only the non-iterative trace information, which drastically reduces the size of the trace file.Peer Reviewe

    A runtime heuristic to selectively replicate tasks for application-specific reliability targets

    Get PDF
    In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Simulation environment for studying overlap of communication and computation

    Get PDF
    Overlapping communication and computation allows both processors and network to be utilized concurrently and leads to two clear benefits: overall speedup and a reduction in network performance requirements. Still, it remains unclear how much overlap can be actually achieved in practice - in real-world applications. This work designs a precise simulation environment that measures how much a scientific MPI application can profit from overlapping communication and computation. The simulation takes into account a wide range of application properties and allows to study overlap on the configurable platform. Additionally, the environment can visualize the simulated time-behaviors, so the non-overlapped and overlapped executions can be compared both quantitatively and qualitatively, providing new insights into the mechanism and potential of overlap. We found that the overlapping potential is very limited by pattern by which an application computes on the communicated data. Finally, we identified as the the biggest benefit of overlap the fact that it can highly relax network constraints without consequently degrading performance.Peer ReviewedPostprint (published version

    Methodology to predict scalability of parallel applications

    Get PDF
    In the road to exascale computing, the inference of expected performance of parallel applications results in a complex task. Performance analysts need to identify the behavior of the applications and to extrapolate it to nonexistent machines. In this work, we present a methodology based on collecting the essential knowledge about fundamental factors of parallel codes, and to analyze in detail the behavior of the application at low core counts on current platforms. The result is a guide to generate the model that best predicts performance at very large scale. Obtained results from executions at low core counts showed expected parallel eficiències with a low relative error

    Bio-inspired call-stack reconstruction for performance analysis

    Get PDF
    The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.We thankfully acknowledge Mathis Bode for giving us access to the Arts CF binaries, and Miguel Castrillo and Kim Serradell for their valuable insight regarding Nemo. We would like to thank Forschungszentrum Jülich for the computation time on their Blue Gene/Q system. This research has been partially funded by the CICYT under contracts No. TIN2012-34557 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
    corecore